This report explores the relationships between various factors influencing student performance, using exploratory data analysis (EDA) to identify key trends and correlations. The analysis focuses on variables such as study habits, access to resources, parental involvement, and environmental factors, and how they impact final exam scores. Insights gained from the data will inform recommendations aimed at improving academic outcomes for students.
The dataset was sourced from Kaggle under the CC0 1.0 universal “No Copyright” license. We are free to copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. Learn more about this license here here.
URL for data in Kaggle: Student Performance Factors Dataset
student_data <- read.csv('../data/StudentPerformanceFactors.csv', header = TRUE)
student_data # Display the dataset Hours_Studied Attendance Parental_Involvement Access_to_Resources
Min. : 1.00 Min. : 60.00 Length:6607 Length:6607
1st Qu.:16.00 1st Qu.: 70.00 Class :character Class :character
Median :20.00 Median : 80.00 Mode :character Mode :character
Mean :19.98 Mean : 79.98
3rd Qu.:24.00 3rd Qu.: 90.00
Max. :44.00 Max. :100.00
Extracurricular_Activities Sleep_Hours Previous_Scores
Length:6607 Min. : 4.000 Min. : 50.00
Class :character 1st Qu.: 6.000 1st Qu.: 63.00
Mode :character Median : 7.000 Median : 75.00
Mean : 7.029 Mean : 75.07
3rd Qu.: 8.000 3rd Qu.: 88.00
Max. :10.000 Max. :100.00
Motivation_Level Internet_Access Tutoring_Sessions Family_Income
Length:6607 Length:6607 Min. :0.000 Length:6607
Class :character Class :character 1st Qu.:1.000 Class :character
Mode :character Mode :character Median :1.000 Mode :character
Mean :1.494
3rd Qu.:2.000
Max. :8.000
Teacher_Quality School_Type Peer_Influence Physical_Activity
Length:6607 Length:6607 Length:6607 Min. :0.000
Class :character Class :character Class :character 1st Qu.:2.000
Mode :character Mode :character Mode :character Median :3.000
Mean :2.968
3rd Qu.:4.000
Max. :6.000
Learning_Disabilities Parental_Education_Level Distance_from_Home
Length:6607 Length:6607 Length:6607
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Gender Exam_Score
Length:6607 Min. : 55.00
Class :character 1st Qu.: 65.00
Mode :character Median : 67.00
Mean : 67.24
3rd Qu.: 69.00
Max. :101.00
'data.frame': 6607 obs. of 20 variables:
$ Hours_Studied : int 23 19 24 29 19 19 29 25 17 23 ...
$ Attendance : int 84 64 98 89 92 88 84 78 94 98 ...
$ Parental_Involvement : chr "Low" "Low" "Medium" "Low" ...
$ Access_to_Resources : chr "High" "Medium" "Medium" "Medium" ...
$ Extracurricular_Activities: chr "No" "No" "Yes" "Yes" ...
$ Sleep_Hours : int 7 8 7 8 6 8 7 6 6 8 ...
$ Previous_Scores : int 73 59 91 98 65 89 68 50 80 71 ...
$ Motivation_Level : chr "Low" "Low" "Medium" "Medium" ...
$ Internet_Access : chr "Yes" "Yes" "Yes" "Yes" ...
$ Tutoring_Sessions : int 0 2 2 1 3 3 1 1 0 0 ...
$ Family_Income : chr "Low" "Medium" "Medium" "Medium" ...
$ Teacher_Quality : chr "Medium" "Medium" "Medium" "Medium" ...
$ School_Type : chr "Public" "Public" "Public" "Public" ...
$ Peer_Influence : chr "Positive" "Negative" "Neutral" "Negative" ...
$ Physical_Activity : int 3 4 4 4 4 3 2 2 1 5 ...
$ Learning_Disabilities : chr "No" "No" "No" "No" ...
$ Parental_Education_Level : chr "High School" "College" "Postgraduate" "High School" ...
$ Distance_from_Home : chr "Near" "Moderate" "Near" "Moderate" ...
$ Gender : chr "Male" "Female" "Male" "Male" ...
$ Exam_Score : int 67 61 74 71 70 71 67 66 69 72 ...
[1] 0
We now check whether the dataset contains any missing values and remove them if necessary.
Hours_Studied Attendance
0 0
Parental_Involvement Access_to_Resources
0 0
Extracurricular_Activities Sleep_Hours
0 0
Previous_Scores Motivation_Level
0 0
Internet_Access Tutoring_Sessions
0 0
Family_Income Teacher_Quality
0 0
School_Type Peer_Influence
0 0
Physical_Activity Learning_Disabilities
0 0
Parental_Education_Level Distance_from_Home
0 0
Gender Exam_Score
0 0
We can see that there are no missing values in the dataset.
We now check the unique values in each categorical column to identify any inconsistencies.
# Unique values in each categorical column
lapply(student_data[, sapply(student_data, is.character)], unique)$Parental_Involvement
[1] "Low" "Medium" "High"
$Access_to_Resources
[1] "High" "Medium" "Low"
$Extracurricular_Activities
[1] "No" "Yes"
$Motivation_Level
[1] "Low" "Medium" "High"
$Internet_Access
[1] "Yes" "No"
$Family_Income
[1] "Low" "Medium" "High"
$Teacher_Quality
[1] "Medium" "High" "Low" ""
$School_Type
[1] "Public" "Private"
$Peer_Influence
[1] "Positive" "Negative" "Neutral"
$Learning_Disabilities
[1] "No" "Yes"
$Parental_Education_Level
[1] "High School" "College" "Postgraduate" ""
$Distance_from_Home
[1] "Near" "Moderate" "Far" ""
$Gender
[1] "Male" "Female"
Low
Medium
High
High
Medium
Low
No
Yes
Low
Medium
High
Yes
No
Low
Medium
High
Medium
High
Low
Public
Private
Positive
Negative
Neutral
No
Yes
High School
College
Postgraduate
Near
Moderate
Far
Male
Female
From the results above we can Teacher_Quality, Parental_Education_Level and Distance_from_Home have missing values. We will now investigate further to see exactly what these missing values are.
# Check for missing values in Teacher_Quality
teacher_quality_missing <- student_data[student_data$Teacher_Quality == "",]
teacher_quality_missing[1] 78
We can see that only 78 rows have missing values in the Teacher_Quality column. We will now investigate the Parental_Education_Level column.
# Check for missing values in Parental_Education_Level
parental_education_level_missing <- student_data[student_data$Parental_Education_Level == "",]
parental_education_level_missing[1] 90
We can see that 90 rows have missing values in the Parental_Education_Level column. We will now investigate the Distance_from_Home column.
# Check for missing values in Distance_from_Home
distance_from_home_missing <- student_data[student_data$Distance_from_Home == "",]
distance_from_home_missing[1] 67
We can see that 67 rows have missing values in the Distance_from_Home column. All the missing values combined make up less than 10% of the dataset. We will remove these rows from the dataset.
# Remove rows with missing values
student_data <- subset(student_data, Teacher_Quality != "" & Parental_Education_Level != "" & Distance_from_Home != "")
# Check for missing values
lapply(student_data[, sapply(student_data, is.character)], unique)$Parental_Involvement
[1] "Low" "Medium" "High"
$Access_to_Resources
[1] "High" "Medium" "Low"
$Extracurricular_Activities
[1] "No" "Yes"
$Motivation_Level
[1] "Low" "Medium" "High"
$Internet_Access
[1] "Yes" "No"
$Family_Income
[1] "Low" "Medium" "High"
$Teacher_Quality
[1] "Medium" "High" "Low"
$School_Type
[1] "Public" "Private"
$Peer_Influence
[1] "Positive" "Negative" "Neutral"
$Learning_Disabilities
[1] "No" "Yes"
$Parental_Education_Level
[1] "High School" "College" "Postgraduate"
$Distance_from_Home
[1] "Near" "Moderate" "Far"
$Gender
[1] "Male" "Female"
Low
Medium
High
High
Medium
Low
No
Yes
Low
Medium
High
Yes
No
Low
Medium
High
Medium
High
Low
Public
Private
Positive
Negative
Neutral
No
Yes
High School
College
Postgraduate
Near
Moderate
Far
Male
Female
We now investigate the dependent variable, Final_Exam_Score, to identify any outliers.
# Outliers in Final_Exam_Score
outliers_in_exam_score <- student_data[student_data$Exam_Score > 100,]
outliers_in_exam_score[1] 1
We can see that only 1 student got exam score of 101 which is an outlier. We will remove this row from the dataset.
Now that the data has been cleaned, we can proceed with the exploratory data analysis.
Here we will explore the distribution of final exam scores among students with without considering other factors. To find our the distribution of final exam scores, we first need to sample the data and plot a histogram.
# Sample the data
set.seed(123)
exam_score_sample <- student_data$Exam_Score[sample(nrow(student_data), 100)]
exam_score_sample [1] 66 70 72 68 62 69 72 66 58 69 63 67 70 64 65 70 75 70 69 69 68 63 75 69 68
[26] 66 68 70 69 64 60 67 66 70 69 65 65 69 69 66 64 64 66 66 66 72 61 71 66 65
[51] 63 69 70 73 66 70 64 68 71 69 63 68 63 65 70 66 71 71 87 72 67 66 71 64 67
[76] 63 72 64 68 66 75 70 64 67 65 66 63 69 68 65 68 65 61 71 69 68 66 61 59 65
# Plot histogram
hist(exam_score_sample, main = "Distribution of Final Exam Scores", xlab = "Final Exam Score", col = "skyblue", border = "black")From the histogram, we can see that the distribution of final exam scores is approximately normal. We now plot a boxplot to visualize the spread of scores and identify any outliers.
# Boxplot of final exam scores
boxplot(exam_score_sample, main = "Boxplot of Final Exam Scores", col = "skyblue", border = "black")The boxplot shows that the distribution of final exam scores is centered around the median, with a few outliers on the lower end of the scale. Now we use numerical methods to confirm the normality of the distribution.
Shapiro-Wilk normality test
data: exam_score_sample
W = 0.92839, p-value = 4.035e-05
The Shapiro-Wilk test confirms that the distribution of final exam scores is not normal, with a p-value less than 0.05.
Next, we explore the distribution of final exam scores based on parental involvement levels. We will create a boxplot to compare the scores of students with different levels of parental involvement.
# Sample the data
high_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "High"][sample(sum(student_data$Parental_Involvement == "High"), 100)]
medium_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "Medium"][sample(sum(student_data$Parental_Involvement == "Medium"), 100)]
low_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "Low"][sample(sum(student_data$Parental_Involvement == "Low"), 100)]
# Histogram of final exam scores by parental involvement
par(mfrow = c(1, 3))
hist(high_parental_involvement, main = "High Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_parental_involvement, main = "Medium Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_parental_involvement, main = "Low Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")The three histograms show the distribution of final exam scores for students with high, medium, and low levels of parental involvement. We can see that the distribution of the scores seems to be similar across all three categories. They seem to follow a normal distribution, with a slight skew towards higher scores for students with high parental involvement. We now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: high_parental_involvement
W = 0.97588, p-value = 0.06325
Shapiro-Wilk normality test
data: medium_parental_involvement
W = 0.97777, p-value = 0.08891
Shapiro-Wilk normality test
data: low_parental_involvement
W = 0.98902, p-value = 0.5864
The Shapiro-Wilk test confirms that the distributions of final exam scores for students with high and medium levels of parental involvement are approximately normal, with p-values greater than 0.05. However, the distribution for students with low parental involvement is slightly skewed, with a p-value less than 0.05.
We now investigate the distribution of final exam scores for students with low parental involvement further. We will now plot a density plot to visualize the distribution more clearly.
# Density plot of final exam scores for students with low parental involvement
plot(density(low_parental_involvement), main = "Density Plot of Final Exam Scores for Low Parental Involvement", xlab = "Final Exam Score", col = "skyblue")The density plot shows that the distribution of final exam is slightly skewed to the left for students with low parental involvement. We will now create a QQ plot to compare the distribution of scores to a normal distribution.
# QQ plot of final exam scores for students with low parental involvement
qqnorm(low_parental_involvement, main = "QQ Plot of Final Exam Scores for Low Parental Involvement", col = "skyblue")
qqline(low_parental_involvement, col = "red")The QQ plot confirms that the distribution of final exam scores for students with low parental involvement is slightly skewed to the left, deviating from a normal distribution.
Next, we explore the distribution of final exam scores based on access to resources. We will create a boxplot to compare the scores of students with different levels of access to resources.
# Sample the data
high_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "High"][sample(sum(student_data$Access_to_Resources == "High"), 100)]
medium_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "Medium"][sample(sum(student_data$Access_to_Resources == "Medium"), 100)]
low_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "Low"][sample(sum(student_data$Access_to_Resources == "Low"), 100)]
# Histogram of final exam scores by access to resources
par(mfrow = c(1, 3))
hist(high_access_to_resources, main = "High Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_access_to_resources, main = "Medium Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_access_to_resources, main = "Low Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")plot(density(low_access_to_resources), main = "Low Access to Resources", xlab = "Final Exam Score", col = "skyblue")
plot(density(medium_access_to_resources), main = "Medium Access to Resources", xlab = "Final Exam Score", col = "skyblue")
plot(density(high_access_to_resources), main = "High Access to Resources", xlab = "Final Exam Score", col = "skyblue")The histograms and density plots show the distribution of final exam scores for students with high, medium, and low levels of access to resources. The distributions seem to be similar across all three categories, with a slight skew towards higher scores for students with high access to resources. We will now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: high_access_to_resources
W = 0.97748, p-value = 0.0844
Shapiro-Wilk normality test
data: medium_access_to_resources
W = 0.97329, p-value = 0.03968
Shapiro-Wilk normality test
data: low_access_to_resources
W = 0.97248, p-value = 0.0343
The Shapiro-Wilk test shows that the distribution of all three categories of access to resources is not normal, with p-values less than 0.05. This indicates that the distribution of final exam scores is skewed for students with different levels of access to resources.
Next, we explore the distribution of final exam scores based on participation in extracurricular activities. We will create a boxplot to compare the scores of students who participate in extracurricular activities and those who do not.
# Sample the data
participate_extracurricular <- student_data$Exam_Score[student_data$Extracurricular_Activities == "Yes"][sample(sum(student_data$Extracurricular_Activities == "Yes"), 100)]
do_not_participate_extracurricular <- student_data$Exam_Score[student_data$Extracurricular_Activities == "No"][sample(sum(student_data$Extracurricular_Activities == "No"), 100)]
# Boxplot of final exam scores by extracurricular activities
boxplot(student_data$Exam_Score ~ student_data$Extracurricular_Activities, main = "Final Exam Scores by Extracurricular Activities", xlab = "Extracurricular Activities", ylab = "Final Exam Score", col = "skyblue", border = "black")The boxplot shows that students who participate in extracurricular activities tend to have higher final exam scores compared to those who do not. Now we will visualize the distribution of scores for both groups using histograms.
# Histogram of final exam scores by extracurricular activities
par(mfrow = c(1, 2))
hist(participate_extracurricular, main = "Extracurricular Activities", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(do_not_participate_extracurricular, main = "No Extracurricular Activities", xlab = "Final Exam Score", col = "skyblue", border = "black")Both histograms show right skewed distributions, with students who participate in extracurricular activities having higher final exam scores. We will now use numerical methods to confirm the normality of the distributions with the following hypothesis test.
Shapiro-Wilk normality test
data: participate_extracurricular
W = 0.97923, p-value = 0.1158
The Shapiro-Wilk test confirms that the distribution of final exam scores for students who participate in extracurricular activities is not normal, with a p-value less than 0.05. Thus, we reject the null hypothesis.
Next we explore the distribution of final exam scores based on motivation levels. We will create a histogram to compare the scores of students with different motivation levels.
# Smaple Data
high_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "High"), 100)]
medium_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "Medium"), 100)]
low_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "Low"), 100)]
# Histogram of final exam scores by motivation level
par(mfrow = c(1, 3))
hist(high_motivation, main = "High Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_motivation, main = "Medium Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_motivation, main = "Low Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")The histograms for High and medium motivation levels show a normal distribution of final exam scores, while the low motivation level histogram shows a right-skewed distribution. We will now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: high_motivation
W = 0.94412, p-value = 0.0003468
Shapiro-Wilk normality test
data: medium_motivation
W = 0.94732, p-value = 0.07268
Shapiro-Wilk normality test
data: low_motivation
W = 0.83108, p-value = 5.376e-07
The Shapiro-Wilk test confirms that the distributions of final exam scores for students with high and medium motivation levels are approximately normal, with p-values greater than 0.05. However, the distribution for students with low motivation levels is slightly skewed, with a p-value less than 0.05.
Next, we explore the distribution of final exam scores based on internet access. We will create a boxplot to compare the scores of students with and without internet access.
# Sample the data
internet_access <- student_data$Exam_Score[student_data$Internet_Access == "Yes"][sample(sum(student_data$Internet_Access == "Yes"), 100)]
no_internet_access <- student_data$Exam_Score[student_data$Internet_Access == "No"][sample(sum(student_data$Internet_Access == "No"), 100)]
# Boxplot of final exam scores by internet access
boxplot(student_data$Exam_Score ~ student_data$Internet_Access, main = "Final Exam Scores by Internet Access", xlab = "Internet Access", ylab = "Final Exam Score", col = "skyblue", border = "black")# Histogram of final exam scores by internet access
par(mfrow = c(1, 2))
hist(internet_access, main = "Internet Access", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(no_internet_access, main = "No Internet Access", xlab = "Final Exam Score", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: internet_access
W = 0.96856, p-value = 0.01718
Shapiro-Wilk normality test
data: no_internet_access
W = 0.93958, p-value = 0.0001819
The boxplot shows that students with internet access tend to have higher final exam scores compared to those without internet access. The histograms show that the distribution of final exam scores for students with no internet access is approximately normal, while the distribution for students with internet access is slightly skewed. The Shapiro-Wilk test confirms that the distribution of final exam scores for students without internet access is normal, with a p-value greater than 0.05. However, the distribution for students with internet access is slightly skewed, with a p-value less than 0.05.
In summary, the distribution of final exam scores is approximately normal when considering all students. However, when examining the scores based on other factors such as parental involvement, access to resources, extracurricular activities, and motivation levels, the distributions vary. Students with high parental involvement and high access to resources tend to have higher final exam scores, while students who participate in extracurricular activities also perform better. Motivation levels also play a role in student performance, with students who are highly motivated achieving higher scores.
Next, we explore the average number of hours students study per week without considering other factors.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.98 24.00 44.00
The summary statistics show that the average number of hours students study per week is approximately 19.98 hours. We will now visualize the distribution of study hours using a histogram.
# Histogram of study hours
hist(student_data$Hours_Studied, main = "Distribution of Study Hours", xlab = "Study Hours", col = "skyblue", border = "black")The distribution of study hours is approximately normal. We will now use numerical methods to confirm the normality of the distribution.
# Shapiro-Wilk test for normality
sample_hours_studied <- student_data$Hours_Studied[sample(nrow(student_data), 100)]
shapiro.test(sample_hours_studied)
Shapiro-Wilk normality test
data: sample_hours_studied
W = 0.99057, p-value = 0.7103
The p-value is greater than 0.05, indicating that the distribution of study hours is approximately normal.
Next, we will explore the average number of hours students study per week based on parental involvement levels.
# Summary statistics for study hours by parental involvement
summary(student_data$Hours_Studied[student_data$Parental_Involvement == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.86 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.99 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.11 24.00 38.00
The summary statistics of the average number of hours students study per week based on parental involvement levels do not show significant differences. We will now use ANOVA to test for differences in study hours based on parental involvement levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "High"][sample(sum(student_data$Parental_Involvement == "High"), 100)]
medium_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "Medium"][sample(sum(student_data$Parental_Involvement == "Medium"), 100)]
low_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "Low"][sample(sum(student_data$Parental_Involvement == "Low"), 100)]
# Histogram of study hours by parental involvement
par(mfrow = c(1, 3))
hist(high_parental_involvement, main = "High Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_parental_involvement, main = "Medium Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
hist(low_parental_involvement, main = "Low Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_parental_involvement
W = 0.9827, p-value = 0.2145
Shapiro-Wilk normality test
data: medium_parental_involvement
W = 0.99085, p-value = 0.7331
Shapiro-Wilk normality test
data: low_parental_involvement
W = 0.986, p-value = 0.3737
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on parental involvement levels.
# ANOVA test for study hours by parental involvement
anova_parental_involvement <- aov(Hours_Studied ~ Parental_Involvement, data = student_data)
summary(anova_parental_involvement) Df Sum Sq Mean Sq F value Pr(>F)
Parental_Involvement 2 49 24.44 0.682 0.506
Residuals 6374 228362 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on parental involvement levels, with a p-value greater than 0.05. This indicates that parental involvement does not have a significant impact on the number of hours students study per week.
We now explore the average number of hours students study per week based on access to resources.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.02 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 16.0 20.0 19.9 24.0 43.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.12 24.00 44.00
The summary statistics of the average number of hours students study per week based on access to resources do not show significant differences. We will now use ANOVA to test for differences in study hours based on access to resources. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_access_to_resources <- student_data$Hours_Studied[student_data$Access_to_Resources == "High"][sample(sum(student_data$Access_to_Resources == "High"), 100)]
medium_access_to_resources <- student_data$Hours_Studied[student_data$Access_to_Resources == "Medium"][sample(sum(student_data$Access_to_Resources == "Medium"), 100)]
low_access_to_resources <- student_data$Hours_Studied[student_data$Access_to_Resources == "Low"][sample(sum(student_data$Access_to_Resources == "Low"), 100)]
# Histogram of study hours by access to resources
par(mfrow = c(1, 3))
hist(high_access_to_resources, main = "High Access to Resources", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_access_to_resources, main = "Medium Access to Resources", xlab = "Study Hours", col = "skyblue", border = "black")
hist(low_access_to_resources, main = "Low Access to Resources", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_access_to_resources
W = 0.98025, p-value = 0.139
Shapiro-Wilk normality test
data: medium_access_to_resources
W = 0.96582, p-value = 0.01068
Shapiro-Wilk normality test
data: low_access_to_resources
W = 0.97123, p-value = 0.02747
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on access to resources.
# ANOVA test for study hours by access to resources
anova_access_to_resources <- aov(Hours_Studied ~ Access_to_Resources, data = student_data)
summary(anova_access_to_resources) Df Sum Sq Mean Sq F value Pr(>F)
Access_to_Resources 2 48 24.12 0.673 0.51
Residuals 6374 228363 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on access to resources, with a p-value greater than 0.05. This indicates that access to resources does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on participation in extracurricular activities.
# Summary statistics for study hours by extracurricular activities
summary(student_data$Hours_Studied[student_data$Extracurricular_Activities == "Yes"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.93 24.00 43.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.04 24.00 44.00
The summary statistics of the average number of hours students study per week based on participation in extracurricular activities show no significant differences. We will now use ANOVA to test for differences in study hours based on participation in extracurricular activities. But first, we need to check the assumptions of ANOVA.
# Sample the data
participate_extracurricular <- student_data$Hours_Studied[student_data$Extracurricular_Activities == "Yes"][sample(sum(student_data$Extracurricular_Activities == "Yes"), 100)]
do_not_participate_extracurricular <- student_data$Hours_Studied[student_data$Extracurricular_Activities == "No"][sample(sum(student_data$Extracurricular_Activities == "No"), 100)]
# Histogram of study hours by extracurricular activities
par(mfrow = c(1, 2))
hist(participate_extracurricular, main = "Extracurricular Activities", xlab = "Study Hours", col = "skyblue", border = "black")
hist(do_not_participate_extracurricular, main = "No Extracurricular Activities", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: participate_extracurricular
W = 0.98586, p-value = 0.3655
Shapiro-Wilk normality test
data: do_not_participate_extracurricular
W = 0.98909, p-value = 0.5916
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on participation in extracurricular activities.
# ANOVA test for study hours by extracurricular activities
anova_extracurricular_activities <- aov(Hours_Studied ~ Extracurricular_Activities, data = student_data)
summary(anova_extracurricular_activities) Df Sum Sq Mean Sq F value Pr(>F)
Extracurricular_Activities 1 17 16.62 0.464 0.496
Residuals 6375 228395 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on participation in extracurricular activities, with a p-value greater than 0.05. This indicates that participation in extracurricular activities does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on motivation levels.
# Summary statistics for study hours by motivation level
summary(student_data$Hours_Studied[student_data$Motivation_Level == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.75 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.06 24.00 43.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 19.98 24.00 44.00
The summary statistics of the average number of hours students study per week based on motivation levels show no significant differences. We will now use ANOVA to test for differences in study hours based on motivation levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_motivation <- student_data$Hours_Studied[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "High"), 100)]
medium_motivation <- student_data$Hours_Studied[student_data$Motivation_Level == "Medium"][sample(sum(student_data$Motivation_Level == "Medium"), 100)]
low_motivation <- student_data$Hours_Studied[student_data$Motivation_Level == "Low"][sample(sum(student_data$Motivation_Level == "Low"), 100)]
# Histogram of study hours by motivation level
par(mfrow = c(1, 3))
hist(high_motivation, main = "High Motivation Level", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_motivation, main = "Medium Motivation Level", xlab = "Study Hours", col = "skyblue", border = "black")
hist(low_motivation, main = "Low Motivation Level", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_motivation
W = 0.98676, p-value = 0.4217
Shapiro-Wilk normality test
data: medium_motivation
W = 0.9649, p-value = 0.009127
Shapiro-Wilk normality test
data: low_motivation
W = 0.99065, p-value = 0.717
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on motivation levels.
# ANOVA test for study hours by motivation level
anova_motivation_level <- aov(Hours_Studied ~ Motivation_Level, data = student_data)
summary(anova_motivation_level) Df Sum Sq Mean Sq F value Pr(>F)
Motivation_Level 2 90 45.14 1.26 0.284
Residuals 6374 228321 35.82
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on motivation levels, with a p-value greater than 0.05. This indicates that motivation levels do not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on internet access.
# Summary statistics for study hours by internet access
summary(student_data$Hours_Studied[student_data$Internet_Access == "Yes"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.99 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 15.00 20.00 19.83 24.00 37.00
The summary statistics of the average number of hours students study per week based on internet access show no significant differences. We will now use ANOVA to test for differences in study hours based on internet access. But first, we need to check the assumptions of ANOVA.
# Sample the data
internet_access <- student_data$Hours_Studied[student_data$Internet_Access == "Yes"][sample(sum(student_data$Internet_Access == "Yes"), 100)]
no_internet_access <- student_data$Hours_Studied[student_data$Internet_Access == "No"][sample(sum(student_data$Internet_Access == "No"), 100)]
# Histogram of study hours by internet access
par(mfrow = c(1, 2))
hist(internet_access, main = "Internet Access", xlab = "Study Hours", col = "skyblue", border = "black")
hist(no_internet_access, main = "No Internet Access", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: internet_access
W = 0.99238, p-value = 0.8479
Shapiro-Wilk normality test
data: no_internet_access
W = 0.99029, p-value = 0.6878
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on internet access.
# ANOVA test for study hours by internet access
anova_internet_access <- aov(Hours_Studied ~ Internet_Access, data = student_data)
summary(anova_internet_access) Df Sum Sq Mean Sq F value Pr(>F)
Internet_Access 1 11 11.39 0.318 0.573
Residuals 6375 228400 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on internet access, with a p-value greater than 0.05. This indicates that internet access does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on family income levels.
# Summary statistics for study hours by family income
summary(student_data$Hours_Studied[student_data$Family_Income == "Low"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.93 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.07 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 16.00 20.00 19.89 24.00 39.00
The summary statistics of the average number of hours students study per week based on family income levels show no significant differences. We will now use ANOVA to test for differences in study hours based on family income levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
low_family_income <- student_data$Hours_Studied[student_data$Family_Income == "Low"][sample(sum(student_data$Family_Income == "Low"), 100)]
medium_family_income <- student_data$Hours_Studied[student_data$Family_Income == "Medium"][sample(sum(student_data$Family_Income == "Medium"), 100)]
high_family_income <- student_data$Hours_Studied[student_data$Family_Income == "High"][sample(sum(student_data$Family_Income == "High"), 100)]
# Histogram of study hours by family income
par(mfrow = c(1, 3))
hist(low_family_income, main = "Low Family Income", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_family_income, main = "Medium Family Income", xlab = "Study Hours", col = "skyblue", border = "black")
hist(high_family_income, main = "High Family Income", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: low_family_income
W = 0.99118, p-value = 0.7595
Shapiro-Wilk normality test
data: medium_family_income
W = 0.98624, p-value = 0.3883
Shapiro-Wilk normality test
data: high_family_income
W = 0.98195, p-value = 0.1881
The Shapiro-Wilk test shows that the distribution of study hours is approximately normal for students with medium and high family income levels, with p-values greater than 0.05. However, the distribution for students with low family income is slightly skewed, with a p-value less than 0.05 indicating that the distribution is not normal. We will now use Kruksal-Wallis test to test for differences in study hours based on family income levels.
# Kruskal-Wallis test for study hours by family income
kruskal.test(Hours_Studied ~ Family_Income, data = student_data)
Kruskal-Wallis rank sum test
data: Hours_Studied by Family_Income
Kruskal-Wallis chi-squared = 1.0581, df = 2, p-value = 0.5892
The Kruskal-Wallis test shows that there is no significant difference in the average number of hours students study per week based on family income levels, with a p-value greater than 0.05. This indicates that family income does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on teacher quality levels.
# Summary statistics for study hours by teacher quality
summary(student_data$Hours_Studied[student_data$Teacher_Quality == "Low"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 16.00 20.00 20.05 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.99 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 19.92 24.00 44.00
The summary statistics of the average number of hours students study per week based on teacher quality levels show no significant differences. We will now use ANOVA to test for differences in study hours based on teacher quality levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
low_teacher_quality <- student_data$Hours_Studied[student_data$Teacher_Quality == "Low"][sample(sum(student_data$Teacher_Quality == "Low"), 100)]
medium_teacher_quality <- student_data$Hours_Studied[student_data$Teacher_Quality == "Medium"][sample(sum(student_data$Teacher_Quality == "Medium"), 100)]
high_teacher_quality <- student_data$Hours_Studied[student_data$Teacher_Quality == "High"][sample(sum(student_data$Teacher_Quality == "High"), 100)]
# Histogram of study hours by teacher quality
par(mfrow = c(1, 3))
hist(low_teacher_quality, main = "Low Teacher Quality", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_teacher_quality, main = "Medium Teacher Quality", xlab = "Study Hours", col = "skyblue", border = "black")
hist(high_teacher_quality, main = "High Teacher Quality", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: low_teacher_quality
W = 0.98158, p-value = 0.1763
Shapiro-Wilk normality test
data: medium_teacher_quality
W = 0.98216, p-value = 0.1952
Shapiro-Wilk normality test
data: high_teacher_quality
W = 0.98864, p-value = 0.5568
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on teacher quality levels.
# ANOVA test for study hours by teacher quality
anova_teacher_quality <- aov(Hours_Studied ~ Teacher_Quality, data = student_data)
summary(anova_teacher_quality) Df Sum Sq Mean Sq F value Pr(>F)
Teacher_Quality 2 12 5.88 0.164 0.849
Residuals 6374 228400 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on teacher quality levels, with a p-value greater than 0.05. This indicates that teacher quality does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on school type.
# Summary statistics for study hours by school type
summary(student_data$Hours_Studied[student_data$School_Type == "Public"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.98 24.00 43.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.97 24.00 44.00
The summary statistics of the average number of hours students study per week based on school type show no significant differences. We will now use ANOVA to test for differences in study hours based on school type. But first, we need to check the assumptions of ANOVA.
# Sample the data
public_school <- student_data$Hours_Studied[student_data$School_Type == "Public"][sample(sum(student_data$School_Type == "Public"), 100)]
private_school <- student_data$Hours_Studied[student_data$School_Type == "Private"][sample(sum(student_data$School_Type == "Private"), 100)]
# Histogram of study hours by school type
par(mfrow = c(1, 2))
hist(public_school, main = "Public School", xlab = "Study Hours", col = "skyblue", border = "black")
hist(private_school, main = "Private School", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: public_school
W = 0.98459, p-value = 0.2967
Shapiro-Wilk normality test
data: private_school
W = 0.98734, p-value = 0.4609
Both histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on school type.
# ANOVA test for study hours by school type
anova_school_type <- aov(Hours_Studied ~ School_Type, data = student_data)
summary(anova_school_type) Df Sum Sq Mean Sq F value Pr(>F)
School_Type 1 0 0.25 0.007 0.934
Residuals 6375 228411 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on school type, with a p-value greater than 0.05. This indicates that school type does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on peer influence levels.
# Summary statistics for study hours by peer influence
summary(student_data$Hours_Studied[student_data$Peer_Influence == "Positive"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.06 24.00 43.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.95 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 19.91 24.00 39.00
The summary statistics of the average number of hours students study per week based on peer influence levels show no significant differences. We will now use ANOVA to test for differences in study hours based on peer influence levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
positive_peer_influence <- student_data$Hours_Studied[student_data$Peer_Influence == "Positive"][sample(sum(student_data$Peer_Influence == "Positive"), 100)]
negative_peer_influence <- student_data$Hours_Studied[student_data$Peer_Influence == "Negative"][sample(sum(student_data$Peer_Influence == "Negative"), 100)]
neutral_peer_influence <- student_data$Hours_Studied[student_data$Peer_Influence == "Neutral"][sample(sum(student_data$Peer_Influence == "Neutral"), 100)]
# Histogram of study hours by peer influence
par(mfrow = c(1, 3))
hist(positive_peer_influence, main = "Positive Peer Influence", xlab = "Study Hours", col = "skyblue", border = "black")
hist(negative_peer_influence, main = "Negative Peer Influence", xlab = "Study Hours", col = "skyblue", border = "black")
hist(neutral_peer_influence, main = "Neutral Peer Influence", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: positive_peer_influence
W = 0.98724, p-value = 0.454
Shapiro-Wilk normality test
data: negative_peer_influence
W = 0.98803, p-value = 0.5107
Shapiro-Wilk normality test
data: neutral_peer_influence
W = 0.97778, p-value = 0.08907
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on peer influence levels.
# ANOVA test for study hours by peer influence
anova_peer_influence <- aov(Hours_Studied ~ Peer_Influence, data = student_data)
summary(anova_peer_influence) Df Sum Sq Mean Sq F value Pr(>F)
Peer_Influence 2 29 14.33 0.4 0.67
Residuals 6374 228383 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on peer influence levels, with a p-value greater than 0.05. This indicates that peer influence does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on learning disability.
# Summary statistics for study hours by learning disability
summary(student_data$Hours_Studied[student_data$Learning_Disabilities == "Yes"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 16.00 20.00 19.73 24.00 35.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 16 20 20 24 44
The summary statistics of the average number of hours students study per week based on learning disability show no significant differences. We will now use ANOVA to test for differences in study hours based on learning disability. But first, we need to check the assumptions of ANOVA.
# Sample the data
learning_disabilities <- student_data$Hours_Studied[student_data$Learning_Disabilities == "Yes"][sample(sum(student_data$Learning_Disabilities == "Yes"), 100)]
no_learning_disabilities <- student_data$Hours_Studied[student_data$Learning_Disabilities == "No"][sample(sum(student_data$Learning_Disabilities == "No"), 100)]
# Histogram of study hours by learning disability
par(mfrow = c(1, 2))
hist(learning_disabilities, main = "Learning Disabilities", xlab = "Study Hours", col = "skyblue", border = "black")
hist(no_learning_disabilities, main = "No Learning Disabilities", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: learning_disabilities
W = 0.98009, p-value = 0.135
Shapiro-Wilk normality test
data: no_learning_disabilities
W = 0.97305, p-value = 0.03802
All histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on learning disability.
# ANOVA test for study hours by learning disability
anova_learning_disabilities <- aov(Hours_Studied ~ Learning_Disabilities, data = student_data)
summary(anova_learning_disabilities) Df Sum Sq Mean Sq F value Pr(>F)
Learning_Disabilities 1 44 43.87 1.225 0.268
Residuals 6375 228367 35.82
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on learning disability, with a p-value greater than 0.05. This indicates that learning disabilities do not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on parental education levels.
# Summary statistics for study hours by parental education level
summary(student_data$Hours_Studied[student_data$Parental_Education_Level == "High School"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.05 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.87 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 16.00 20.00 19.97 24.00 39.00
The summary statistics of the average number of hours students study per week based on parental education levels show no significant differences. We will now use ANOVA to test for differences in study hours based on parental education levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_school_education <- student_data$Hours_Studied[student_data$Parental_Education_Level == "High School"][sample(sum(student_data$Parental_Education_Level == "High School"), 100)]
college_education <- student_data$Hours_Studied[student_data$Parental_Education_Level == "College"][sample(sum(student_data$Parental_Education_Level == "College"), 100)]
postgraduate_education <- student_data$Hours_Studied[student_data$Parental_Education_Level == "Postgraduate"][sample(sum(student_data$Parental_Education_Level == "Postgraduate"), 100)]
# Histogram of study hours by parental education level
par(mfrow = c(1, 3))
hist(high_school_education, main = "High School Education", xlab = "Study Hours", col = "skyblue", border = "black")
hist(college_education, main = "College Education", xlab = "Study Hours", col = "skyblue", border = "black")
hist(postgraduate_education, main = "Postgraduate Education", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_school_education
W = 0.98413, p-value = 0.2746
Shapiro-Wilk normality test
data: college_education
W = 0.97967, p-value = 0.1253
Shapiro-Wilk normality test
data: postgraduate_education
W = 0.98131, p-value = 0.168
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on parental education levels.
# ANOVA test for study hours by parental education level
anova_parental_education_level <- aov(Hours_Studied ~ Parental_Education_Level, data = student_data)
summary(anova_parental_education_level) Df Sum Sq Mean Sq F value Pr(>F)
Parental_Education_Level 2 37 18.71 0.522 0.593
Residuals 6374 228374 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on parental education levels, with a p-value greater than 0.05. This indicates that parental education levels do not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on the distance from home to school.
# Summary statistics for study hours by distance from home (Near, Moderate, Far
summary(student_data$Hours_Studied[student_data$Distance_from_Home == "Near"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 16.0 20.0 19.9 24.0 43.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.05 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.0 16.0 20.0 20.2 24.0 39.0
The summary statistics of the average number of hours students study per week based on the distance from home to school show no significant differences. We will now use ANOVA to test for differences in study hours based on the distance from home to school. But first, we need to check the assumptions of ANOVA.
# Sample the data
near_distance <- student_data$Hours_Studied[student_data$Distance_from_Home == "Near"][sample(sum(student_data$Distance_from_Home == "Near"), 100)]
moderate_distance <- student_data$Hours_Studied[student_data$Distance_from_Home == "Moderate"][sample(sum(student_data$Distance_from_Home == "Moderate"), 100)]
far_distance <- student_data$Hours_Studied[student_data$Distance_from_Home == "Far"][sample(sum(student_data$Distance_from_Home == "Far"), 100)]
# Histogram of study hours by distance from home
par(mfrow = c(1, 3))
hist(near_distance, main = "Near Distance from Home", xlab = "Study Hours", col = "skyblue", border = "black")
hist(moderate_distance, main = "Moderate Distance from Home", xlab = "Study Hours", col = "skyblue", border = "black")
hist(far_distance, main = "Far Distance from Home", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: near_distance
W = 0.97892, p-value = 0.1094
Shapiro-Wilk normality test
data: moderate_distance
W = 0.99076, p-value = 0.7258
Shapiro-Wilk normality test
data: far_distance
W = 0.98969, p-value = 0.6399
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on the distance from home to school.
# ANOVA test for study hours by distance from home
anova_distance_from_home <- aov(Hours_Studied ~ Distance_from_Home, data = student_data)
summary(anova_distance_from_home) Df Sum Sq Mean Sq F value Pr(>F)
Distance_from_Home 2 67 33.48 0.934 0.393
Residuals 6374 228344 35.82
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on the distance from home to school, with a p-value greater than 0.05. This indicates that the distance from home to school does not have a significant impact on the number of hours students study per week.
Finally, we explore the average number of hours students study per week based on gender.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.94 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.02 24.00 44.00
The summary statistics of the average number of hours students study per week based on gender show no significant differences. We will now use ANOVA to test for differences in study hours based on gender. But first, we need to check the assumptions of ANOVA.
# Sample the data
male_data <- student_data$Hours_Studied[student_data$Gender == "Male"][sample(sum(student_data$Gender == "Male"), 100)]
female_data <- student_data$Hours_Studied[student_data$Gender == "Female"][sample(sum(student_data$Gender == "Female"), 100)]
# Histogram
par(mfrow = c(1, 2))
hist(male_data, main = "Male", xlab = "Study Hours", col = "skyblue", border = "black")
hist(female_data, main = "Female", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: male_data
W = 0.99016, p-value = 0.6773
Shapiro-Wilk normality test
data: female_data
W = 0.9903, p-value = 0.6889
Both histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based gender.
Df Sum Sq Mean Sq F value Pr(>F)
Gender 1 11 11.12 0.31 0.577
Residuals 6375 228400 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study based on gender.
In summary, the average number of hours students study per week is approximately 19.98 hours. There are no significant differences in the average number of hours students study per week based on various factors such as parental involvement, access to resources, extracurricular activities, motivation levels, internet access, family income, teacher quality, school type, peer influence, learning disabilities, parental education levels, distance from home to school, and Gender. This indicates that the number of hours students study per week is consistent across different factors.
Now we analyze the variation in attendance rates across students.
# Sample the data
sample_attendance <- student_data$Attendance[sample(nrow(student_data), 1000)]
# Summary statistics for attendance rates
summary(sample_attendance) Min. 1st Qu. Median Mean 3rd Qu. Max.
60.0 70.0 81.0 80.8 92.0 100.0
# Histogram of attendance rates
hist(sample_attendance, main = "Distribution of Attendance Rates", xlab = "Attendance Rate", col = "skyblue", border = "black")From the histogram we can see that the distribution of attendance
rates is approximately uniform, ranging from approximately
60% to 100%.
We can see one slightly higher peak around
70% to 80%.
From this we can see that attendance rates are evenly distributed across
students. The average attendance rate is approximately
79.75%, which is relatively high. We now
investigate the factors that may influence attendance rates.
We will now explore the attendance rates based on parental involvement levels.
# Summary statistics for attendance rates by parental involvement
summary(student_data$Attendance[student_data$Parental_Involvement == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 70.00 80.00 80.01 90.00 100.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.0 70.0 80.0 79.9 90.0 100.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 70.00 80.00 80.32 91.00 100.00
The summary statistics of the attendance rates based on parental involvement levels show no significant differences. We will now use ANOVA to test for differences in attendance rates based on parental involvement levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_parental_involvement_attendance <- student_data$Attendance[student_data$Parental_Involvement == "High"][sample(sum(student_data$Parental_Involvement == "High"), 100)]
medium_parental_involvement_attendance <- student_data$Attendance[student_data$Parental_Involvement == "Medium"][sample(sum(student_data$Parental_Involvement == "Medium"), 100)]
low_parental_involvement_attendance <- student_data$Attendance[student_data$Parental_Involvement == "Low"][sample(sum(student_data$Parental_Involvement == "Low"), 100)]
# Histogram of attendance rates by parental involvement
par(mfrow = c(1, 3))
hist(high_parental_involvement_attendance, main = "High Parental Involvement", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(medium_parental_involvement_attendance, main = "Medium Parental Involvement", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(low_parental_involvement_attendance, main = "Low Parental Involvement", xlab = "Attendance Rate", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_parental_involvement_attendance
W = 0.9466, p-value = 0.0004986
Shapiro-Wilk normality test
data: medium_parental_involvement_attendance
W = 0.92294, p-value = 2.023e-05
Shapiro-Wilk normality test
data: low_parental_involvement_attendance
W = 0.928, p-value = 3.842e-05
All three histograms and the Shapiro-Wilk test show that the distribution of attendance rates is not normal for all students. We will now use Kruksal-Wallis test to test for differences in attendance rates based on parental involvement levels.
Kruskal-Wallis rank sum test
data: Attendance by Parental_Involvement
Kruskal-Wallis chi-squared = 1.2171, df = 2, p-value = 0.5441
The Kruskal-Wallis test shows that there is no significant difference in the average attendance rates based on parental involvement levels, with a p-value greater than 0.05. This indicates that parental involvement does not have a significant impact on attendance rates.
Next, we explore the attendance rates based on access to resources.
# Summary statistics for attendance rates by access to resources
summary(student_data$Attendance[student_data$Access_to_Resources == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 70.00 79.00 79.87 90.00 100.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
60 70 80 80 90 100
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 71.00 80.00 80.29 91.00 100.00
The summary statistics of the attendance rates based on access to resources show no significant differences. We will now use ANOVA to test for differences in attendance rates based on access to resources. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_access_to_resources_attendance <- student_data$Attendance[student_data$Access_to_Resources == "High"][sample(sum(student_data$Access_to_Resources == "High"), 100)]
medium_access_to_resources_attendance <- student_data$Attendance[student_data$Access_to_Resources == "Medium"][sample(sum(student_data$Access_to_Resources == "Medium"), 100)]
low_access_to_resources_attendance <- student_data$Attendance[student_data$Access_to_Resources == "Low"][sample(sum(student_data$Access_to_Resources == "Low"), 100)]
# Histogram of attendance rates by access to resources
par(mfrow = c(1, 3))
hist(high_access_to_resources_attendance, main = "High Access to Resources", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(medium_access_to_resources_attendance, main = "Medium Access to Resources", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(low_access_to_resources_attendance, main = "Low Access to Resources", xlab = "Attendance Rate", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_access_to_resources_attendance
W = 0.95112, p-value = 0.0009836
Shapiro-Wilk normality test
data: medium_access_to_resources_attendance
W = 0.94251, p-value = 0.0002752
Shapiro-Wilk normality test
data: low_access_to_resources_attendance
W = 0.93998, p-value = 0.0001923
All three histograms and the Shapiro-Wilk test show that the distribution of attendance rates is not normal for all students. We will now use Kruksal-Wallis test to test for differences in attendance rates based on access to resources.
Kruskal-Wallis rank sum test
data: Attendance by Access_to_Resources
Kruskal-Wallis chi-squared = 1.0494, df = 2, p-value = 0.5917
The Kruskal-Wallis test shows that there is no significant difference in the average attendance rates based on access to resources, with a p-value greater than 0.05. This indicates that access to resources does not have a significant impact on attendance rates.
Next, we explore the attendance rates based on participation in extracurricular activities.
# Summary statistics for attendance rates by extracurricular activities
summary(student_data$Attendance[student_data$Extracurricular_Activities == "Yes"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
60 70 80 80 90 100
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 70.00 80.00 80.05 90.00 100.00
The summary statistics of the attendance rates based on participation in extracurricular activities show no significant differences. We will now use ANOVA to test for differences in attendance rates based on participation in extracurricular activities. But first, we need to check the assumptions of ANOVA.
# Sample the data
participate_extracurricular_attendance <- student_data$Attendance[student_data$Extracurricular_Activities == "Yes"][sample(sum(student_data$Extracurricular_Activities == "Yes"), 100)]
do_not_participate_extracurricular_attendance <- student_data$Attendance[student_data$Extracurricular_Activities == "No"][sample(sum(student_data$Extracurricular_Activities == "No"), 100)]
# Histogram of attendance rates by extracurricular activities
par(mfrow = c(1, 2))
hist(participate_extracurricular_attendance, main = "Extracurricular Activities", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(do_not_participate_extracurricular_attendance, main = "No Extracurricular Activities", xlab = "Attendance Rate", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: participate_extracurricular_attendance
W = 0.96044, p-value = 0.004321
Shapiro-Wilk normality test
data: do_not_participate_extracurricular_attendance
W = 0.93609, p-value = 0.0001122
Both histograms and the Shapiro-Wilk test show that the distribution of attendance rates is not normal for all students. We will now use Kruksal-Wallis test to test for differences in attendance rates based on participation in extracurricular activities.
Kruskal-Wallis rank sum test
data: Attendance by Extracurricular_Activities
Kruskal-Wallis chi-squared = 0.026414, df = 1, p-value = 0.8709
The Kruskal-Wallis test shows that there is no significant difference in the average attendance rates based on participation in extracurricular activities, with a p-value greater than 0.05. This indicates that participation in extracurricular activities does not have a significant impact on attendance rates.